See photo:
Peer-to-peer (P2P) was a phenomenon less than ten years ago, exploding
in popularity by offering a break from traditional banking. Individuals
flocked to the alternative credit markets as alternative sources of
funding and for new opportunities to finance their small business
ventures. Peer-to-peer (P2P) lending enables individuals to obtain loans
directly from other individuals, cutting out the financial institution
as the middleman. P2P lending is also known as “social lending” or
“crowd lending.” It has only existed since 2005, but the crowd of
competitors already includes Prosper, LendingClub, Upstart, and
StreetShares. P2P lending websites connect borrowers directly to
investors. The site sets the rates and terms and enables the
transactions. P2P lenders are individual investors who want to get a
better return on their cash savings than they would get from a bank
savings account or certificate of deposit. P2P borrowers seek an
alternative to traditional banks or a lower interest rate. The default
rates for P2P loans are much higher than those in traditional finance.
P2P lending websites connect borrowers directly to lenders. Each website
sets the rates and the terms and enables the transaction. Most sites
have a wide range of interest rates based on the creditworthiness of the
applicant. First, an investor opens an account with the site and
deposits a sum of money to be dispersed in loans. The loan applicant
posts a financial profile that is assigned a risk category that
determines the interest rate the applicant will pay. The loan applicant
can review offers and accept one. (Some applicants break up their
requests into chunks and accept multiple offers.) The money transfer and
the monthly payments are handled through the platform. The process can
be entirely automated, or lenders and borrowers can choose to haggle.
Some sites specialize in particular types of borrowers. StreetShares,
for example, is designed for small businesses. And LendingClub has a
“Patient Solutions” category that links doctors who offer financing
programs with prospective patients. Peer-to-peer lending is riskier than
a savings account or certificate of deposit, but the interest rates are
often much higher. This is because people who invest in a peer-to-peer
lending site assume most of the risk, which is normally assumed by banks
or other financial institutions. Although direct P2P lending has
undergone changes over recent years, it remains a viable option for
borrowers and investors. The global peer-to-peer lending market was
worth 83.79 billion USD in 2021, according to figures from Precedence
Research. This figure is projected to reach $705.81 billion by 2031. The
simplest way to invest in peer-to-peer lending is to make an account on
a P2P lending site and begin lending money to borrowers. These sites
typically let the lender choose the profile of their borrowers, so they
can choose between high risk/high returns or more modest returns.
Alternatively, many P2P lending sites are public companies, so one can
also invest in them by buying their stock. See photo:
See photo:
LendingClub is a financial services company headquartered in San
Francisco, California. It was the first peer-to-peer lender to register
it LendingClub enabled borrowers to create unsecured personal loans
between 1,000 USD and 40,000 USD. The standard loan period was three
years. Investors were able to search and browse the loan listings on
LendingClub website and select loans that they wanted to invest in based
on the information supplied about the borrower, amount of loan, loan
grade, and loan purpose. Investors made money from the interest on these
loans. LendingClub made money by charging borrowers an origination fee
and investors a service fees offerings as securities with the Securities
and Exchange Commission, and to offer loan trading on a secondary
market. LendingClub screens potential borrowers and services the loans
once they’re approved. The risk: Investors – not LendingClub – make the
final decision whether or not to lend the money. That decision is based
on the LendingClub grade, utilizing credit and income data, assigned to
every approved borrower. That data, known only to the investors, also
helps determine the range of interest rates offered to the borrower.
LendingClub’s typical annual percentage rate (APR) is between 5.99% and
35.89%. There is also an origination fee of 1% to 6% taken off the top
of the loan. Once approved, your loan amount will arrive at your bank
account in about one week. There’s a monthly repayment schedule that
stretches over three to five years (36-60 monthly payments). LendingClub
loans are generally pursued by borrowers with good-to-excellent credit
(scores average 700) and a low debt-to-income ratio (the average is
12%). Borrowers can file a joint application, which could lead to a
larger loan line because of multiple incomes. LendingClub probably isn’t
the best option for borrowers with bad credit. That would bring a high
interest rate and steep origination fee, meaning you could probably do
better with a different type of loan. See photo:
Here,we are seeking to understand the factors that might have signaled risky loans or borrowing practices and could be consumed or applied by prospective borrowers, lenders, and/or investors considering participating in direct P2P via the LendingClub.
Our dataset contains over 9,500 observations of loan data from LendingClub, the largest online platform for direct P2P lending.We have obtained the dataset from Kaggle. The link for the same is: [https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis]
P2P Lending rose to popularity in the years of 2007 to 2015 as can be
seen in the timeline: See photo:
So, we believe that the timeframe of 2007 to 2015 provides the most relevant data for prospective individual investors today, particularly because it is unlikely to include a significant number of large institutional lenders.
In our dataset the variables are defined as follows:
# This is copied form the Kaggle site
# We will use a kable table for simplicity
data_definitions <- data.frame(variable = c("credit.policy", "purpose", "int.rate", "installment", "log.annual.inc", "dti", "fico", "days.with.cr.line", "revol.bal", "revol.util", "inq.last.6mths", "delinq.2yrs", "pub.rec", "not.fully.paid"),
definition = c("1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.",
"The purpose of the loan (takes values creditcard, debtconsolidation, educational, majorpurchase, smallbusiness, and all_other).",
"The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.",
"The monthly installments owed by the borrower if the loan is funded.",
"The natural log of the self-reported annual income of the borrower.",
"The debt-to-income ratio of the borrower (amount of debt divided by annual income).",
"The FICO credit score of the borrower.",
"The number of days the borrower has had a credit line.",
"The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).",
"The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).",
"The borrower's number of inquiries by creditors in the last 6 months.",
"The number of times the borrower had been 30+ days past due on a payment in the past 2 years.",
"The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).",
"Whether the borrower will be fully paid or not."))
knitr::kable(data_definitions)
| variable | definition |
|---|---|
| credit.policy | 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. |
| purpose | The purpose of the loan (takes values creditcard, debtconsolidation, educational, majorpurchase, smallbusiness, and all_other). |
| int.rate | The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. |
| installment | The monthly installments owed by the borrower if the loan is funded. |
| log.annual.inc | The natural log of the self-reported annual income of the borrower. |
| dti | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
| fico | The FICO credit score of the borrower. |
| days.with.cr.line | The number of days the borrower has had a credit line. |
| revol.bal | The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). |
| revol.util | The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). |
| inq.last.6mths | The borrower’s number of inquiries by creditors in the last 6 months. |
| delinq.2yrs | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
| pub.rec | The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). |
| not.fully.paid | Whether the borrower will be fully paid or not. |
As with any analysis ours will also start with formulation of the
SMART QUESTION and Subsequent EDA. By SMART QUESTION we mean a problem
statement arising from out dataset that has the following
characteristics: 1.Specific
2.Measurable 3.Achievable 4.Relevant 5.Time-Oriented
See photo:
To help us structure our EDA in
an efficient way we have decided that, our exploratory data analysis
will closely adhere to the 9-step1 checklist presented in Chapter 4 of The
Art of Data Science.2
These are the elements of our checklist:
Our analysis will explore things such as income-to-debt ratios, credit score, interest rates, and delinquencies among direct P2P borrowers in an attempt to understand the risks and opportunities associated with P2P. Specifically, we intend to examine the impact that these variables have on who received loans and who defaulted on their loans between 2007 and 2015. We also intend on graphically depicting how the variables are related to each other. Our primary intention is to understand the dataset and answer few of the following questions : 1. What variable or variables, if any, have an impact on if the person meets the credit underwriting criteria? How strong is that impact? 2. What variable or variables, if any, have an impact on if the person fully repays the loan? How strong is that impact? 3. Do borrowers who meet the credit underwriting criteria have a lower chance of not fully repaying the loan? If so, how big of a difference is it, and is it statistically significant?
We start by loading the tidyverse and ezids libraries3 and reading on our dataset. Please note we also require packages corrplot and scale to efficiently conduct our EDA.If we require any other packages we will import them along the way.
library(ezids) # We will use functions form this package to get nicer looking results
library(tidyverse) # We need this package for data manipulation, piping, and graphing
library(corrplot) # We will need this package to plot the correlation matrix
library(scales) # This package will help us when labeling the scatter plots
library(gridExtra) # For additional table and image functionality
# Read in the data from the working directory
loans <- read_csv("loan_data.csv")
#loans
We can see that our dataset contains 9578 rows of data with 14 columns. Next let’s examine the structure of the dataset. This will give us a better understanding of what we are dealing with. This is essentially a comprehensive preview of the type of the variables in our data and one of the introductory steps of any EDA.
# There is unfortunately no ezids function to see the result in a nice looking table, so we will use the standard function.
str(loans)
## spec_tbl_df [9,578 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ credit.policy : num [1:9578] 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : chr [1:9578] "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
## $ int.rate : num [1:9578] 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num [1:9578] 829 228 367 162 103 ...
## $ log.annual.inc : num [1:9578] 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num [1:9578] 19.5 14.3 11.6 8.1 15 ...
## $ fico : num [1:9578] 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num [1:9578] 5640 2760 4710 2700 4066 ...
## $ revol.bal : num [1:9578] 28854 33623 3511 33667 4740 ...
## $ revol.util : num [1:9578] 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : num [1:9578] 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : num [1:9578] 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : num [1:9578] 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : num [1:9578] 0 0 0 0 0 0 1 1 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. credit.policy = col_double(),
## .. purpose = col_character(),
## .. int.rate = col_double(),
## .. installment = col_double(),
## .. log.annual.inc = col_double(),
## .. dti = col_double(),
## .. fico = col_double(),
## .. days.with.cr.line = col_double(),
## .. revol.bal = col_double(),
## .. revol.util = col_double(),
## .. inq.last.6mths = col_double(),
## .. delinq.2yrs = col_double(),
## .. pub.rec = col_double(),
## .. not.fully.paid = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
By examining the structure of our data, we can see that there is only one character variable which might be a factor, and some of the numeric variables look like logicals.
We can also look at the top and bottom rows of our dataset to get a better feel for the data. This will help us better understand the values in our dataset and how to most effectively deal with them.
The first 5 observations of our dataset are as follows:
# We use the xkabledplyhead() function form the ezids package to see the result in a nice looking table.
xkabledplyhead(loans)
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | debt_consolidation | 0.1189 | 829.10 | 11.3504 | 19.48 | 737 | 5639.958 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
| 1 | credit_card | 0.1071 | 228.22 | 11.0821 | 14.29 | 707 | 2760.000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
| 1 | debt_consolidation | 0.1357 | 366.86 | 10.3735 | 11.63 | 682 | 4710.000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
| 1 | debt_consolidation | 0.1008 | 162.34 | 11.3504 | 8.10 | 712 | 2699.958 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
| 1 | credit_card | 0.1426 | 102.92 | 11.2997 | 14.97 | 667 | 4066.000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
The bottom 5 observations of our dataset are as follows:
# We use the xkabledplytail() function form the ezids package to see the result in a nice looking table.
xkabledplytail(loans)
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | all_other | 0.1461 | 344.76 | 12.1808 | 10.39 | 672 | 10474.000 | 215372 | 82.1 | 2 | 0 | 0 | 1 |
| 0 | all_other | 0.1253 | 257.70 | 11.1419 | 0.21 | 722 | 4380.000 | 184 | 1.1 | 5 | 0 | 0 | 1 |
| 0 | debt_consolidation | 0.1071 | 97.81 | 10.5966 | 13.09 | 687 | 3450.042 | 10036 | 82.9 | 8 | 0 | 0 | 1 |
| 0 | home_improvement | 0.1600 | 351.58 | 10.8198 | 19.18 | 692 | 1800.000 | 0 | 3.2 | 5 | 0 | 0 | 1 |
| 0 | debt_consolidation | 0.1392 | 853.43 | 11.2645 | 16.28 | 732 | 4740.000 | 37879 | 57.0 | 6 | 0 | 0 | 1 |
The top and bottom rows of our dataset indicate the data is structured in an acceptable way and that our variables match up with the values for each column. This means we can now move on to the descriptive statistics part of our EDA.
We can get some descriptive statistics of the variables to help us better understand the data. The xkablesummary() gives us the 5 point summary of the dataset and the mean.
# We use the xkablesummary() function from the ezids package to see the result in a nice looking table.
xkablesummary(loans)
| credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Min. :0.000 | Length:9578 | Min. :0.0600 | Min. : 15.67 | Min. : 7.548 | Min. : 0.000 | Min. :612.0 | Min. : 179 | Min. : 0 | Min. : 0.0 | Min. : 0.000 | Min. : 0.0000 | Min. :0.00000 | Min. :0.0000 |
| Q1 | 1st Qu.:1.000 | Class :character | 1st Qu.:0.1039 | 1st Qu.:163.77 | 1st Qu.:10.558 | 1st Qu.: 7.213 | 1st Qu.:682.0 | 1st Qu.: 2820 | 1st Qu.: 3187 | 1st Qu.: 22.6 | 1st Qu.: 0.000 | 1st Qu.: 0.0000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 |
| Median | Median :1.000 | Mode :character | Median :0.1221 | Median :268.95 | Median :10.929 | Median :12.665 | Median :707.0 | Median : 4140 | Median : 8596 | Median : 46.3 | Median : 1.000 | Median : 0.0000 | Median :0.00000 | Median :0.0000 |
| Mean | Mean :0.805 | NA | Mean :0.1226 | Mean :319.09 | Mean :10.932 | Mean :12.607 | Mean :710.8 | Mean : 4561 | Mean : 16914 | Mean : 46.8 | Mean : 1.577 | Mean : 0.1637 | Mean :0.06212 | Mean :0.1601 |
| Q3 | 3rd Qu.:1.000 | NA | 3rd Qu.:0.1407 | 3rd Qu.:432.76 | 3rd Qu.:11.291 | 3rd Qu.:17.950 | 3rd Qu.:737.0 | 3rd Qu.: 5730 | 3rd Qu.: 18250 | 3rd Qu.: 70.9 | 3rd Qu.: 2.000 | 3rd Qu.: 0.0000 | 3rd Qu.:0.00000 | 3rd Qu.:0.0000 |
| Max | Max. :1.000 | NA | Max. :0.2164 | Max. :940.14 | Max. :14.528 | Max. :29.960 | Max. :827.0 | Max. :17640 | Max. :1207359 | Max. :119.0 | Max. :33.000 | Max. :13.0000 | Max. :5.00000 | Max. :1.0000 |
From this we can see that some of the variables that appeared to be logicals, like inq.last.6mths, delinq.2yrs, and pub.rec are actually not. However,credit.policy and not.fully.paid is.
We have an idea of what to expect for a few variables, such as interest rate and credit score, so we were able to test the dataset against some of our expectations to gauge its reliability. By inspecting the dataframe, we can see that interest rates for the data are between 6% and 21.64% and credit scores range from 612 to 827. Although interest rates might seem to reach excessively high rates or credit scores too meager, the P2P market tended to consist of more risky loans. This aligned with our expectation and reinforced our confidence in the dataset.
The range of the utilization, or the percent of credit being used, is between 0% and 119%. Someone utilizing more than 100% of the credit available to them initially seemed erroneous; however, this can occur from technical error, creditors and collectors reporting at different date/times, borrowers opening and closing credit lines, or possibly when borrowers appear as authorized users of others’ credit lines. Regardless, only 27 loans within our dataset appear to exceed the standard maximum of 100% so we do not expect this to have a significant effect on our analysis, thereby allowing us to move on to the next step of our EDA.
But, before we move on we wanted to see the a measure of dispersion/variation namely standard deviation for the numeric variables as a part of descriptive statistics section of our EDA. They are as follows:
library(expss)
tab = loans %>%
tab_cells(credit.policy,int.rate, installment, log.annual.inc, dti, fico ,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid) %>%
tab_cols() %>%
tab_stat_sd(label = "Std. dev.") %>%
tab_pivot() %>%
set_caption("Table with Standard Deviation of all numeric variables.")
tab
| Table with Standard Deviation of all numeric variables. | |
| #Total | |
|---|---|
| credit.policy | |
| Std. dev. | 0.4 |
| int.rate | |
| Std. dev. | 0.0 |
| installment | |
| Std. dev. | 207.1 |
| log.annual.inc | |
| Std. dev. | 0.6 |
| dti | |
| Std. dev. | 6.9 |
| fico | |
| Std. dev. | 38.0 |
| days.with.cr.line | |
| Std. dev. | 2496.9 |
| revol.bal | |
| Std. dev. | 33756.2 |
| revol.util | |
| Std. dev. | 29.0 |
| inq.last.6mths | |
| Std. dev. | 2.2 |
| delinq.2yrs | |
| Std. dev. | 0.5 |
| pub.rec | |
| Std. dev. | 0.3 |
| not.fully.paid | |
| Std. dev. | 0.4 |
png("sd.png", height=700, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
According to the Kaggle site where we got this dataset from, there are 9,578 rows and 14 columns, which matches what we have. The site also shows that there is no missing data. Let’s verify that by adding the total number of missing cells in the dataset, which is 0, and check the total number of null cells, which is 0. We can also check if the observations are unique, and we see that all 9578 rows are unique.
Let’s start by making a histogram for each non-logical numeric variable.
# By gathering the variables we want to see into a long format with the gather() function, we can then create a histogram
# for each variable using the facet_wrap() function in ggplot2.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "steelblue", color = "black") +
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Count") +
theme_minimal()
Some of these variables look somewhat normal, and it would make sense to create a QQ-Plot for them later. But first let’s create boxplots for these same variables.
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_boxplot(fill = "steelblue", color = "black",
outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Boxplots of Numeric Variables", x = "Value") +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
We can see that some of these variables have issues with outliers.
Now let’s look at the Factor and logical variables with bar charts.
# By gathering the variables we want to see into a long format with the gather() function, we can then create a bar graph
# for each variable using the facet_wrap() function in ggplot2.
loans %>%
select(credit.policy, purpose, not.fully.paid) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_bar(fill = "steelblue", color = "black") +
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
coord_flip() +
labs(title = "Bar Charts of Non-Numeric Variables", x = "Value", y = "Count") +
theme_minimal() +
theme()
From this we can see that the purpose variable would be
a good candidate to perform ANOVA tests on.
Let’s try the easy way of looking at meeting the credit underwriting criteria vs the borrower fully paying. We can group by the credit.policy variable, and calculate the percentage of borrowers in each category who did not fully pay.
# We will convert the average to an easier to read percentage by multiplying by 100, rounding, and adding a "%" at the end.
loans %>%
group_by(credit.policy) %>%
summarize(percent_not_fully_paid = paste0(round(100*mean(not.fully.paid), 1), "%")) %>%
ungroup() %>%
knitr::kable(align = "c")
| credit.policy | percent_not_fully_paid |
|---|---|
| 0 | 27.8% |
| 1 | 13.2% |
From this we can see that about 13.2% of borrowers who met the credit underwriting criteria did not fully pay, while for the borrowers who did not meet the credit underwriting criteria about 27.8% did not fully pay.
This indicates borrowers who did not meet the credit underwriting criteria were almost twice as likely to be default on their loans than those who did meet the criteria. For comparison, default rates on loans from commercial banks for the same period as our dataset averaged 4.48%, with a maximum default rate of 7.49% default rate towards the end of 2009, according to the St. Louis Federal Reserve Bank.4
Let us go back to the table of data definitions and add a column for the variable type.
# Add a type column and reorder it so that definition is last
data_definitions_augmented <- data_definitions %>%
mutate(type = c("Logical", "Factor", "Numeric", "Numeric", "Numeric", "Numeric", "Integer", "Numeric", "Integer", "Numeric", "Integer", "Integer", "Integer", "Logical")) %>%
select(variable, type, definition)
# We will use a kable table for simplicity
knitr::kable(data_definitions_augmented)
| variable | type | definition |
|---|---|---|
| credit.policy | Logical | 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. |
| purpose | Factor | The purpose of the loan (takes values creditcard, debtconsolidation, educational, majorpurchase, smallbusiness, and all_other). |
| int.rate | Numeric | The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. |
| installment | Numeric | The monthly installments owed by the borrower if the loan is funded. |
| log.annual.inc | Numeric | The natural log of the self-reported annual income of the borrower. |
| dti | Numeric | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
| fico | Integer | The FICO credit score of the borrower. |
| days.with.cr.line | Numeric | The number of days the borrower has had a credit line. |
| revol.bal | Integer | The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). |
| revol.util | Numeric | The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). |
| inq.last.6mths | Integer | The borrower’s number of inquiries by creditors in the last 6 months. |
| delinq.2yrs | Integer | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
| pub.rec | Integer | The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). |
| not.fully.paid | Logical | Whether the borrower will be fully paid or not. |
Let’s also convert some of the variables to a more appropriate type.
the credit.policy and not.fully.paid variables
are logicals, and the purpose variable we will use as a
factor.
# These variables may act differently from here on out
loans$credit.policy <- as.logical(loans$credit.policy)
loans$not.fully.paid <- as.logical(loans$not.fully.paid)
loans$purpose <- as.factor(loans$purpose)
Now we can further explore the data to see how the numeric variables
differ based on the on the credit.policy,
not.fully.paid, and purpose variables. Let’s
make some boxplots to visualize this.
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each credit policy value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, credit.policy) %>%
gather(variable, value, -credit.policy) %>%
ggplot(aes(x = value, y = as.logical(credit.policy), fill = as.logical(credit.policy))) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `credit.policy` Values",
x = "Value", y = "Count", fill = "Credit Policy") +
theme_minimal()
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each not fully paid value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, not.fully.paid) %>%
gather(variable, value, -not.fully.paid) %>%
ggplot(aes(x = value, y = as.logical(not.fully.paid), fill = as.logical(not.fully.paid))) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `not.fully.paid` Values",
x = "Value", y = "Count", fill = "Not Fully Paid") +
theme_minimal()
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each purpose value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, purpose) %>%
gather(variable, value, -purpose) %>%
ggplot(aes(x = value, y = purpose, fill = purpose)) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) +
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `purpose` Values",
x = "Value", y = "Count", fill = "Purpose") +
theme_minimal()
loans_covarience_matrix <- loans %>%
select(-purpose) %>%
cov()
loans_covarience_matrix
## credit.policy int.rate installment log.annual.inc
## credit.policy 1.570099e-01 -0.0031285128 4.822100e+00 8.503673e-03
## int.rate -3.128513e-03 0.0007207607 1.535130e+00 9.306428e-04
## installment 4.822100e+00 1.5351296712 4.287852e+04 5.704792e+01
## log.annual.inc 8.503673e-03 0.0009306428 5.704792e+01 3.779947e-01
## dti -2.479528e-01 0.0406600857 7.156135e+01 -2.288211e-01
## fico 5.240672e+00 -0.7286843825 6.764941e+02 2.674749e+00
## days.with.cr.line 9.797606e+01 -8.3138330083 9.477258e+04 5.171847e+02
## revol.bal -2.508193e+03 83.8528268415 1.633027e+06 7.723287e+03
## revol.util -1.196760e+00 0.3620848507 4.887925e+02 9.789922e-01
## inq.last.6mths -4.668777e-01 0.0119782213 -4.746828e+00 3.946114e-02
## delinq.2yrs -1.651796e-02 0.0022887736 -4.940054e-01 9.807039e-03
## pub.rec -5.634017e-03 0.0006907971 -1.778157e+00 2.660161e-03
## not.fully.paid -2.297364e-02 0.0015706471 3.792995e+00 -7.538466e-03
## dti fico days.with.cr.line revol.bal
## credit.policy -2.479528e-01 5.240672e+00 9.797606e+01 -2.508193e+03
## int.rate 4.066009e-02 -7.286844e-01 -8.313833e+00 8.385283e+01
## installment 7.156135e+01 6.764941e+02 9.477258e+04 1.633027e+06
## log.annual.inc -2.288211e-01 2.674749e+00 5.171847e+02 7.723287e+03
## dti 4.738904e+01 -6.304443e+01 1.033066e+03 4.386056e+04
## fico -6.304443e+01 1.441762e+03 2.501838e+04 -1.993428e+04
## days.with.cr.line 1.033066e+03 2.501838e+04 6.234661e+06 1.933070e+07
## revol.bal 4.386056e+04 -1.993428e+04 1.933070e+07 1.139480e+09
## revol.util 6.733229e+01 -5.963347e+02 -1.756061e+03 1.995845e+05
## inq.last.6mths 4.421091e-01 -1.548021e+01 -2.292940e+02 1.663280e+03
## delinq.2yrs -8.194136e-02 -4.486898e+00 1.109825e+02 -6.129401e+02
## pub.rec 1.120352e-02 -1.468994e+00 4.701103e+01 -2.743852e+02
## not.fully.paid 9.430733e-02 -2.083784e+00 -2.676802e+01 6.646676e+02
## revol.util inq.last.6mths delinq.2yrs pub.rec
## credit.policy -1.196760e+00 -0.46687768 -1.651796e-02 -5.634017e-03
## int.rate 3.620849e-01 0.01197822 2.288774e-03 6.907971e-04
## installment 4.887925e+02 -4.74682830 -4.940054e-01 -1.778157e+00
## log.annual.inc 9.789922e-01 0.03946114 9.807039e-03 2.660161e-03
## dti 6.733229e+01 0.44210915 -8.194136e-02 1.120352e-02
## fico -5.963347e+02 -15.48020939 -4.486898e+00 -1.468994e+00
## days.with.cr.line -1.756061e+03 -229.29404099 1.109825e+02 4.701103e+01
## revol.bal 1.995845e+05 1663.28033876 -6.129401e+02 -2.743852e+02
## revol.util 8.418364e+02 -0.88607632 -6.773480e-01 5.074089e-01
## inq.last.6mths -8.860763e-01 4.84107945 2.553287e-02 4.191352e-02
## delinq.2yrs -6.773480e-01 0.02553287 2.983507e-01 1.314967e-03
## pub.rec 5.074089e-01 0.04191352 1.314967e-03 6.871021e-02
## not.fully.paid 8.733217e-01 0.12057426 1.778727e-03 4.674501e-03
## not.fully.paid
## credit.policy -0.022973644
## int.rate 0.001570647
## installment 3.792994623
## log.annual.inc -0.007538466
## dti 0.094307333
## fico -2.083784075
## days.with.cr.line -26.768023630
## revol.bal 664.667576358
## revol.util 0.873321666
## inq.last.6mths 0.120574263
## delinq.2yrs 0.001778727
## pub.rec 0.004674501
## not.fully.paid 0.134450952
png("loans_covarience_matrix.png", height=2000, width=2000)
p<-tableGrob(loans_covarience_matrix)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
# For our correlation matrix we want to include everything but the purpose variable
loans_correlation_matrix <- loans %>%
select(-purpose) %>%
cor()
loans_correlation_matrix
png("loans_correlation_matrix.png", height=2000, width=2000)
p<-tableGrob(loans_correlation_matrix)
grid.arrange(p)
dev.off()
# The mixed correlation plot makes a nice visualization
corrplot.mixed(loans_correlation_matrix, tl.pos = "lt")
loans %>%
ggplot(aes(x = fico, y = int.rate)) +
geom_point(color = "steelblue", alpha = 0.2) +
labs(title = "Interest Rate vs FICO Score",
x = "FICO Score", y = "Interest Rate") +
scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
theme_minimal()
loans %>%
ggplot(aes(x = int.rate, y = revol.util)) +
geom_point(color = "steelblue", alpha = 0.2) +
labs(title = "Revolving Line Utilization Rate vs Interest Rate",
x = "Interest Rate", y = "Revolving Line Utilization Rate") +
scale_x_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(scale = 1)) +
theme_minimal()
loans %>%
ggplot(aes(x = log.annual.inc, y = installment)) +
geom_point(color = "steelblue", alpha = 0.2) +
labs(title = "Installment vs Log of Annual Income",
x = "Log of Annual Income", y = "Installment") +
theme_minimal()
loans %>%
ggplot(aes(x = fico, y = revol.util)) +
geom_point(color = "steelblue", alpha = 0.2) +
labs(title = "Revolving Line Utilization Rate vs FICO Score",
x = "FICO Score", y = "Revolving Line Utilization Rate") +
scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
scale_y_continuous(labels = label_percent(scale = 1)) +
theme_minimal()
A t-test is a statistical test that compares the means of two samples. It is used in hypothesis testing, with a null hypothesis that the difference in group means is zero and an alternate hypothesis that the difference in group means is different from zero. We can perform t-tests for each numeric variable in our dataset. Not all of the results may be useful, but we want to have them available for further consideration.
======= ## T-tests:
library(broom)
library(purrr)
ttest95rate = t.test(x=loans$int.rate) # default conf.level = 0.95
ttest99rate = t.test(x=loans$int.rate, conf.level=0.99 )
ttest50rate = t.test(x=loans$int.rate, conf.level=0.50 )
tab <- map_df(list(ttest95rate, ttest99rate, ttest50rate),tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.123 447. 0 9577 0.122 0.123 One Sample t-… two.si…
## 2 0.123 447. 0 9577 0.122 0.123 One Sample t-… two.si…
## 3 0.123 447. 0 9577 0.122 0.123 One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t1.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95installment = t.test(x=loans$installment) # default conf.level = 0.95
ttest99installment = t.test(x=loans$installment, conf.level=0.99 )
ttest50installment = t.test(x=loans$installment, conf.level=0.50 )
tab <- map_df(list(ttest95installment,ttest99installment,ttest50installment), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 319. 151. 0 9577 315. 323. One Sample t-… two.si…
## 2 319. 151. 0 9577 314. 325. One Sample t-… two.si…
## 3 319. 151. 0 9577 318. 321. One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t2.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95annual = t.test(x=loans$log.annual.inc) # default conf.level = 0.95
ttest99annual = t.test(x=loans$log.annual.inc, conf.level=0.99 )
ttest50annual = t.test(x=loans$log.annual.inc, conf.level=0.50 )
tab <- map_df(list(ttest95annual,ttest99annual,ttest50annual), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 10.9 1740. 0 9577 10.9 10.9 One Sample t-… two.si…
## 2 10.9 1740. 0 9577 10.9 10.9 One Sample t-… two.si…
## 3 10.9 1740. 0 9577 10.9 10.9 One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t3.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95fico = t.test(x=loans$fico) # default conf.level = 0.95
ttest99fico = t.test(x=loans$fico, conf.level=0.99 )
ttest50fico = t.test(x=loans$fico, conf.level=0.50 )
tab <- map_df(list(ttest95fico,ttest99fico,ttest50fico), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 711. 1832. 0 9577 710. 712. One Sample t-… two.si…
## 2 711. 1832. 0 9577 710. 712. One Sample t-… two.si…
## 3 711. 1832. 0 9577 711. 711. One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t4.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95dti = t.test(x=loans$dti) # default conf.level = 0.95
ttest99dti = t.test(x=loans$dti, conf.level=0.99 )
ttest50dti = t.test(x=loans$dti, conf.level=0.50 )
tab <- map_df(list(ttest95dti,ttest99dti,ttest50dti), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 12.6 179. 0 9577 12.5 12.7 One Sample t-… two.si…
## 2 12.6 179. 0 9577 12.4 12.8 One Sample t-… two.si…
## 3 12.6 179. 0 9577 12.6 12.7 One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t5.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95days.with.cr.line = t.test(x=loans$days.with.cr.line) # default conf.level = 0.95
ttest99days.with.cr.line = t.test(x=loans$days.with.cr.line, conf.level=0.99 )
ttest50days.with.cr.line = t.test(x=loans$days.with.cr.line, conf.level=0.50 )
tab <- map_df(list(ttest95days.with.cr.line,ttest99days.with.cr.line,ttest50days.with.cr.line), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 4561. 179. 0 9577 4511. 4611. One Sample t-… two.si…
## 2 4561. 179. 0 9577 4495. 4626. One Sample t-… two.si…
## 3 4561. 179. 0 9577 4544. 4578. One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t6.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95revol.bal = t.test(x=loans$revol.bal) # default conf.level = 0.95
ttest99revol.bal = t.test(x=loans$revol.bal, conf.level=0.99 )
ttest50revol.bal = t.test(x=loans$revol.bal, conf.level=0.50 )
tab <- map_df(list(ttest95revol.bal,ttest99revol.bal,ttest50revol.bal), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 16914. 49.0 0 9577 16238. 17590. One Sample t-… two.si…
## 2 16914. 49.0 0 9577 16025. 17803. One Sample t-… two.si…
## 3 16914. 49.0 0 9577 16681. 17147. One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t7.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95revol.util = t.test(x=loans$revol.util) # default conf.level = 0.95
ttest99revol.util = t.test(x=loans$revol.util, conf.level=0.99 )
ttest50revol.util = t.test(x=loans$revol.util, conf.level=0.50 )
tab <- map_df(list(ttest95revol.util,ttest99revol.util,ttest50revol.util), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 46.8 158. 0 9577 46.2 47.4 One Sample t-… two.si…
## 2 46.8 158. 0 9577 46.0 47.6 One Sample t-… two.si…
## 3 46.8 158. 0 9577 46.6 47.0 One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t8.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95inq.last.6mths = t.test(x=loans$inq.last.6mths) # default conf.level = 0.95
ttest99inq.last.6mths = t.test(x=loans$inq.last.6mths, conf.level=0.99 )
ttest50inq.last.6mths = t.test(x=loans$inq.last.6mths, conf.level=0.50 )
tab <- map_df(list(ttest95inq.last.6mths,ttest99inq.last.6mths,ttest50inq.last.6mths), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1.58 70.2 0 9577 1.53 1.62 One Sample t-… two.si…
## 2 1.58 70.2 0 9577 1.52 1.64 One Sample t-… two.si…
## 3 1.58 70.2 0 9577 1.56 1.59 One Sample t-… two.si…
## # … with abbreviated variable name ¹alternative
png("t9.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95delinq.2yrs = t.test(x=loans$delinq.2yrs) # default conf.level = 0.95
ttest99delinq.2yrs = t.test(x=loans$delinq.2yrs, conf.level=0.99 )
ttest50delinq.2yrs = t.test(x=loans$delinq.2yrs, conf.level=0.50 )
tab <- map_df(list(ttest95delinq.2yrs,ttest99delinq.2yrs,ttest50delinq.2yrs), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.164 29.3 3.51e-181 9577 0.153 0.175 One Sample … two.si…
## 2 0.164 29.3 3.51e-181 9577 0.149 0.178 One Sample … two.si…
## 3 0.164 29.3 3.51e-181 9577 0.160 0.167 One Sample … two.si…
## # … with abbreviated variable name ¹alternative
png("t10.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ttest95pub.rec = t.test(x=loans$pub.rec) # default conf.level = 0.95
ttest99pub.rec = t.test(x=loans$pub.rec, conf.level=0.99 )
ttest50pub.rec = t.test(x=loans$pub.rec, conf.level=0.50 )
tab <- map_df(list(ttest95pub.rec,ttest99pub.rec,ttest50pub.rec), tidy)
tab
## # A tibble: 3 × 8
## estimate statistic p.value parameter conf.low conf.high method alter…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.0621 23.2 7.89e-116 9577 0.0569 0.0674 One Sample … two.si…
## 2 0.0621 23.2 7.89e-116 9577 0.0552 0.0690 One Sample … two.si…
## 3 0.0621 23.2 7.89e-116 9577 0.0603 0.0639 One Sample … two.si…
## # … with abbreviated variable name ¹alternative
png("t11.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ANOVA stands for Analysis of Variance. It’s a statistical test that
was developed by Ronald Fisher in 1918 and has been in use ever since.
Put simply, ANOVA tells you if there are any statistical differences
between the means of three or more independent groups. One-way ANOVA is
the most basic form. Tukey’s Honest Significant Difference (HSD) test is
a post hoc test commonly used to assess the significance of differences
between pairs of group means. Tukey HSD is often a follow up to one-way
ANOVA, when the F-test has revealed the existence of a significant
difference between some of the tested groups.. The purpose
variable in our data is a good candidate to perform ANOVA tests with. We
will do an ANOVA test for each variable based on purpose.
We will also be doing Tuckey tests here. Again not all of the results
may be useful, but we want to have them available for further
consideration.
aovrate=aov(int.rate ~ purpose, data = loans)
aovratesummary=summary(aovrate)
aovratesummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 0.351 0.05850 85.46 <2e-16 ***
## Residuals 9571 6.552 0.00068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovrateturkey=TukeyHSD(aovrate)
aovrateturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = int.rate ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 0.0029676657 0.0002712055 0.0056641258
## debt_consolidation-all_other 0.0098244685 0.0078099637 0.0118389734
## educational-all_other 0.0031367610 -0.0013252303 0.0075987522
## home_improvement-all_other 0.0007359906 -0.0027307042 0.0042026854
## major_purchase-all_other -0.0025995895 -0.0066215462 0.0014223673
## small_business-all_other 0.0213165483 0.0178278713 0.0248052252
## debt_consolidation-credit_card 0.0068568029 0.0043625116 0.0093510941
## educational-credit_card 0.0001690953 -0.0045290559 0.0048672465
## home_improvement-credit_card -0.0022316751 -0.0059974727 0.0015341226
## major_purchase-credit_card -0.0055672551 -0.0098497071 -0.0012848031
## small_business-credit_card 0.0183488826 0.0145628390 0.0221349262
## educational-debt_consolidation -0.0066877076 -0.0110305128 -0.0023449023
## home_improvement-debt_consolidation -0.0090884779 -0.0124003602 -0.0057765957
## major_purchase-debt_consolidation -0.0124240580 -0.0163133674 -0.0085347486
## small_business-debt_consolidation 0.0114920797 0.0081571947 0.0148269648
## home_improvement-educational -0.0024007703 -0.0075795444 0.0027780037
## major_purchase-educational -0.0057363504 -0.0113021266 -0.0001705743
## small_business-educational 0.0181797873 0.0129862726 0.0233733020
## major_purchase-home_improvement -0.0033355801 -0.0081404183 0.0014692582
## small_business-home_improvement 0.0205805576 0.0162123542 0.0249487611
## small_business-major_purchase 0.0239161377 0.0190954153 0.0287368602
## p adj
## credit_card-all_other 0.0201496
## debt_consolidation-all_other 0.0000000
## educational-all_other 0.3688491
## home_improvement-all_other 0.9959908
## major_purchase-all_other 0.4762263
## small_business-all_other 0.0000000
## debt_consolidation-credit_card 0.0000000
## educational-credit_card 0.9999999
## home_improvement-credit_card 0.5837844
## major_purchase-credit_card 0.0024341
## small_business-credit_card 0.0000000
## educational-debt_consolidation 0.0001153
## home_improvement-debt_consolidation 0.0000000
## major_purchase-debt_consolidation 0.0000000
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.8194333
## major_purchase-educational 0.0383509
## small_business-educational 0.0000000
## major_purchase-home_improvement 0.3848074
## small_business-home_improvement 0.0000000
## small_business-major_purchase 0.0000000
aovinstallment=aov(installment ~ purpose, data = loans)
aovinstallmentsummary=summary(aovinstallment)
aovinstallmentsummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 33502096 5583683 141.7 <2e-16 ***
## Residuals 9571 377145527 39405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovinstallmentturkey=TukeyHSD(aovinstallment)
aovinstallmentturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = installment ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 74.563171 54.10481 95.021537
## debt_consolidation-all_other 114.046848 98.76256 129.331137
## educational-all_other -27.390341 -61.24400 6.463320
## home_improvement-all_other 92.134048 65.83182 118.436275
## major_purchase-all_other -1.453629 -31.96870 29.061438
## small_business-all_other 188.889066 162.42006 215.358074
## debt_consolidation-credit_card 39.483677 20.55919 58.408162
## educational-credit_card -101.953512 -137.59895 -66.308077
## home_improvement-credit_card 17.570877 -11.00068 46.142433
## major_purchase-credit_card -76.016800 -108.50828 -43.525325
## small_business-credit_card 114.325894 85.60073 143.051059
## educational-debt_consolidation -141.437189 -174.38657 -108.487806
## home_improvement-debt_consolidation -21.912800 -47.04045 3.214846
## major_purchase-debt_consolidation -115.500477 -145.00913 -85.991822
## small_business-debt_consolidation 74.842218 49.54005 100.144389
## home_improvement-educational 119.524389 80.23241 158.816366
## major_purchase-educational 25.936712 -16.29150 68.164920
## small_business-educational 216.279406 176.87559 255.683222
## major_purchase-home_improvement -93.587677 -130.04256 -57.132795
## small_business-home_improvement 96.755018 63.61294 129.897099
## small_business-major_purchase 190.342694 153.76730 226.918091
## p adj
## credit_card-all_other 0.0000000
## debt_consolidation-all_other 0.0000000
## educational-all_other 0.2045045
## home_improvement-all_other 0.0000000
## major_purchase-all_other 0.9999994
## small_business-all_other 0.0000000
## debt_consolidation-credit_card 0.0000000
## educational-credit_card 0.0000000
## home_improvement-credit_card 0.5388086
## major_purchase-credit_card 0.0000000
## small_business-credit_card 0.0000000
## educational-debt_consolidation 0.0000000
## home_improvement-debt_consolidation 0.1348022
## major_purchase-debt_consolidation 0.0000000
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.0000000
## major_purchase-educational 0.5403655
## small_business-educational 0.0000000
## major_purchase-home_improvement 0.0000000
## small_business-home_improvement 0.0000000
## small_business-major_purchase 0.0000000
aovannual=aov(log.annual.inc ~ purpose, data = loans)
aovannualsummary=summary(aovannual)
aovannualsummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 163 27.224 75.38 <2e-16 ***
## Residuals 9571 3457 0.361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovannualturkey=TukeyHSD(aovannual)
aovannualturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log.annual.inc ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 0.201917013 0.13998034 0.26385368
## debt_consolidation-all_other 0.067595668 0.02132325 0.11386808
## educational-all_other -0.295357330 -0.39784758 -0.19286708
## home_improvement-all_other 0.356665308 0.27703664 0.43629398
## major_purchase-all_other -0.000417996 -0.09280082 0.09196483
## small_business-all_other 0.300902905 0.22076931 0.38103650
## debt_consolidation-credit_card -0.134321345 -0.19161427 -0.07702842
## educational-credit_card -0.497274342 -0.60518910 -0.38935958
## home_improvement-credit_card 0.154748296 0.06824935 0.24124724
## major_purchase-credit_card -0.202335009 -0.30070131 -0.10396870
## small_business-credit_card 0.098985892 0.01202190 0.18594988
## educational-debt_consolidation -0.362952998 -0.46270559 -0.26320040
## home_improvement-debt_consolidation 0.289069640 0.21299696 0.36514232
## major_purchase-debt_consolidation -0.068013664 -0.15734963 0.02132230
## small_business-debt_consolidation 0.233307237 0.15670619 0.30990829
## home_improvement-educational 0.652022638 0.53306815 0.77097712
## major_purchase-educational 0.294939334 0.16709556 0.42278311
## small_business-educational 0.596260234 0.47696716 0.71555330
## major_purchase-home_improvement -0.357083304 -0.46744862 -0.24671798
## small_business-home_improvement -0.055762404 -0.15609839 0.04457358
## small_business-major_purchase 0.301320901 0.19059073 0.41205107
## p adj
## credit_card-all_other 0.0000000
## debt_consolidation-all_other 0.0003342
## educational-all_other 0.0000000
## home_improvement-all_other 0.0000000
## major_purchase-all_other 1.0000000
## small_business-all_other 0.0000000
## debt_consolidation-credit_card 0.0000000
## educational-credit_card 0.0000000
## home_improvement-credit_card 0.0000028
## major_purchase-credit_card 0.0000000
## small_business-credit_card 0.0139438
## educational-debt_consolidation 0.0000000
## home_improvement-debt_consolidation 0.0000000
## major_purchase-debt_consolidation 0.2714642
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.0000000
## major_purchase-educational 0.0000000
## small_business-educational 0.0000000
## major_purchase-home_improvement 0.0000000
## small_business-home_improvement 0.6569306
## small_business-major_purchase 0.0000000
aovdti=aov(dti ~ purpose, data = loans)
aovdtisummary=summary(aovdti)
aovdtisummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 25645 4274 95.54 <2e-16 ***
## Residuals 9571 428200 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovdtiturkey=TukeyHSD(aovdti)
aovdtiturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = dti ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 3.01989971 2.3305501 3.709249320
## debt_consolidation-all_other 2.99696390 2.4819561 3.511971740
## educational-all_other 0.26542904 -0.8752783 1.406136404
## home_improvement-all_other -0.88199409 -1.7682541 0.004265877
## major_purchase-all_other -0.91961249 -1.9478251 0.108600126
## small_business-all_other -0.28620243 -1.1780821 0.605677282
## debt_consolidation-credit_card -0.02293582 -0.6606010 0.614729338
## educational-credit_card -2.75447067 -3.9555523 -1.553389051
## home_improvement-credit_card -3.90189381 -4.8646194 -2.939168245
## major_purchase-credit_card -3.93951220 -5.0343204 -2.844704021
## small_business-credit_card -3.30610214 -4.2740036 -2.338200706
## educational-debt_consolidation -2.73153485 -3.8417723 -1.621297380
## home_improvement-debt_consolidation -3.87895799 -4.7256402 -3.032275818
## major_purchase-debt_consolidation -3.91657638 -4.9108777 -2.922275050
## small_business-debt_consolidation -3.28316633 -4.1357292 -2.430603493
## home_improvement-educational -1.14742314 -2.4713759 0.176529618
## major_purchase-educational -1.18504153 -2.6079313 0.237848255
## small_business-educational -0.55163148 -1.8793527 0.776089726
## major_purchase-home_improvement -0.03761839 -1.2659745 1.190737746
## small_business-home_improvement 0.59579166 -0.5209389 1.712522180
## small_business-major_purchase 0.63341005 -0.5990069 1.865826982
## p adj
## credit_card-all_other 0.0000000
## debt_consolidation-all_other 0.0000000
## educational-all_other 0.9933845
## home_improvement-all_other 0.0520775
## major_purchase-all_other 0.1149573
## small_business-all_other 0.9649306
## debt_consolidation-credit_card 0.9999999
## educational-credit_card 0.0000000
## home_improvement-credit_card 0.0000000
## major_purchase-credit_card 0.0000000
## small_business-credit_card 0.0000000
## educational-debt_consolidation 0.0000000
## home_improvement-debt_consolidation 0.0000000
## major_purchase-debt_consolidation 0.0000000
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.1399662
## major_purchase-educational 0.1757487
## small_business-educational 0.8845620
## major_purchase-home_improvement 1.0000000
## small_business-home_improvement 0.6995489
## small_business-major_purchase 0.7355365
aovfico=aov(fico ~ purpose, data = loans)
aovficosummary=summary(aovfico)
aovficosummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 477491 79582 57.14 <2e-16 ***
## Residuals 9571 13330261 1393
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovficoturkey=TukeyHSD(aovfico)
aovficoturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = fico ~ purpose, data = loans)
##
## $purpose
## diff lwr upr p adj
## credit_card-all_other -5.717275 -9.5635101 -1.8710409 0.0002382
## debt_consolidation-all_other -11.472691 -14.3461837 -8.5991986 0.0000000
## educational-all_other -7.061260 -13.4258502 -0.6966688 0.0184988
## home_improvement-all_other 9.461983 4.5170846 14.4068814 0.0000004
## major_purchase-all_other 7.159374 1.4224493 12.8962990 0.0043919
## small_business-all_other 4.644633 -0.3316207 9.6208869 0.0858251
## debt_consolidation-credit_card -5.755416 -9.3132762 -2.1975551 0.0000384
## educational-credit_card -1.343984 -8.0454337 5.3574656 0.9970747
## home_improvement-credit_card 15.179258 9.8077194 20.5507975 0.0000000
## major_purchase-credit_card 12.876650 6.7681539 18.9851453 0.0000000
## small_business-credit_card 10.361909 4.9614906 15.7623265 0.0000003
## educational-debt_consolidation 4.411432 -1.7831520 10.6060152 0.3525562
## home_improvement-debt_consolidation 20.934674 16.2106006 25.6587477 0.0000000
## major_purchase-debt_consolidation 18.632065 13.0843488 24.1797818 0.0000000
## small_business-debt_consolidation 16.117324 11.3604394 20.8742090 0.0000000
## home_improvement-educational 16.523243 9.1362318 23.9102532 0.0000000
## major_purchase-educational 14.220634 6.2816026 22.1596647 0.0000027
## small_business-educational 11.705893 4.2978559 19.1139294 0.0000657
## major_purchase-home_improvement -2.302609 -9.1562370 4.5510193 0.9561484
## small_business-home_improvement -4.817350 -11.0481615 1.4134617 0.2537562
## small_business-major_purchase -2.514741 -9.3910264 4.3615443 0.9345421
aovcrline=aov(days.with.cr.line ~ purpose, data = loans)
aovcrlinesummary=summary(aovcrline)
aovcrlinesummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 7.136e+08 118941264 19.3 <2e-16 ***
## Residuals 9571 5.900e+10 6164006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovcrlineturkey=TukeyHSD(aovcrline)
aovcrlineturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = days.with.cr.line ~ purpose, data = loans)
##
## $purpose
## diff lwr upr p adj
## credit_card-all_other 545.29979 289.42551 801.17407 0.0000000
## debt_consolidation-all_other 221.33099 30.16926 412.49272 0.0114478
## educational-all_other -303.11043 -726.52066 120.29980 0.3460038
## home_improvement-all_other 890.28941 561.32551 1219.25331 0.0000000
## major_purchase-all_other 14.26296 -367.39123 395.91714 0.9999998
## small_business-all_other 580.40963 249.35978 911.45947 0.0000050
## debt_consolidation-credit_card -323.96880 -560.65874 -87.27887 0.0010731
## educational-credit_card -848.41022 -1294.23030 -402.59014 0.0000004
## home_improvement-credit_card 344.98962 -12.35694 702.33618 0.0665993
## major_purchase-credit_card -531.03684 -937.41011 -124.66356 0.0022503
## small_business-credit_card 35.10984 -324.15792 394.37759 0.9999536
## educational-debt_consolidation -524.44141 -936.54177 -112.34106 0.0033319
## home_improvement-debt_consolidation 668.95842 354.68510 983.23175 0.0000000
## major_purchase-debt_consolidation -207.06803 -576.13496 161.99889 0.6465642
## small_business-debt_consolidation 359.07864 42.62252 675.53476 0.0144457
## home_improvement-educational 1193.39984 701.97218 1684.82750 0.0000000
## major_purchase-educational 317.37338 -210.77793 845.52470 0.5671043
## small_business-educational 883.52005 390.69362 1376.34649 0.0000026
## major_purchase-home_improvement -876.02645 -1331.97035 -420.08256 0.0000003
## small_business-home_improvement -309.87978 -724.39024 104.63067 0.2929493
## small_business-major_purchase 566.14667 108.69548 1023.59786 0.0049222
aovrbal=aov(revol.bal ~ purpose, data = loans)
aovrbalsummary=summary(aovrbal)
aovrbalsummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 2.114e+11 3.524e+10 31.52 <2e-16 ***
## Residuals 9571 1.070e+13 1.118e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovrbalturkey=TukeyHSD(aovrbal)
aovrbalturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = revol.bal ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 10296.9807 6850.81493 13743.1465
## debt_consolidation-all_other 4263.6707 1689.06653 6838.2750
## educational-all_other -2054.1419 -7756.71522 3648.4313
## home_improvement-all_other 4445.7169 15.16559 8876.2681
## major_purchase-all_other -5601.5868 -10741.78137 -461.3922
## small_business-all_other 14698.1637 10239.51843 19156.8089
## debt_consolidation-credit_card -6033.3100 -9221.09712 -2845.5228
## educational-credit_card -12351.1226 -18355.51622 -6346.7291
## home_improvement-credit_card -5851.2638 -10664.07848 -1038.4492
## major_purchase-credit_card -15898.5675 -21371.68365 -10425.4514
## small_business-credit_card 4401.1830 -437.50668 9239.8726
## educational-debt_consolidation -6317.8127 -11868.06227 -767.5631
## home_improvement-debt_consolidation 182.0461 -4050.64959 4414.7418
## major_purchase-debt_consolidation -9865.2576 -14835.92436 -4894.5907
## small_business-debt_consolidation 10434.4929 6172.39887 14696.5870
## home_improvement-educational 6499.8588 -118.78670 13118.5043
## major_purchase-educational -3547.4449 -10660.69193 3565.8022
## small_business-educational 16752.3056 10114.82106 23389.7901
## major_purchase-home_improvement -10047.3037 -16188.04681 -3906.5605
## small_business-home_improvement 10252.4468 4669.73747 15835.1561
## small_business-major_purchase 20299.7505 14138.70680 26460.7941
## p adj
## credit_card-all_other 0.0000000
## debt_consolidation-all_other 0.0000218
## educational-all_other 0.9389848
## home_improvement-all_other 0.0485656
## major_purchase-all_other 0.0223352
## small_business-all_other 0.0000000
## debt_consolidation-credit_card 0.0000005
## educational-credit_card 0.0000000
## home_improvement-credit_card 0.0062412
## major_purchase-credit_card 0.0000000
## small_business-credit_card 0.1027922
## educational-debt_consolidation 0.0139363
## home_improvement-debt_consolidation 0.9999996
## major_purchase-debt_consolidation 0.0000001
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.0581190
## major_purchase-educational 0.7623958
## small_business-educational 0.0000000
## major_purchase-home_improvement 0.0000293
## small_business-home_improvement 0.0000013
## small_business-major_purchase 0.0000000
aovrutil=aov(revol.util ~ purpose, data = loans)
aovrutilsummary=summary(aovrutil)
aovrutilsummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 626354 104392 134.4 <2e-16 ***
## Residuals 9571 7435913 777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovrutilturkey=TukeyHSD(aovrutil)
aovrutilturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = revol.util ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 13.8881545 11.015499 16.7608104
## debt_consolidation-all_other 14.4131833 12.267044 16.5593226
## educational-all_other -0.9111547 -5.664707 3.8423980
## home_improvement-all_other -5.4376945 -9.130915 -1.7444743
## major_purchase-all_other -7.2544262 -11.539191 -2.9696613
## small_business-all_other 0.3581153 -3.358524 4.0747541
## debt_consolidation-credit_card 0.5250287 -2.132248 3.1823053
## educational-credit_card -14.7993093 -19.804453 -9.7941651
## home_improvement-credit_card -19.3258490 -23.337716 -15.3139816
## major_purchase-credit_card -21.1425807 -25.704862 -16.5802989
## small_business-credit_card -13.5300392 -17.563476 -9.4966029
## educational-debt_consolidation -15.3243380 -19.950917 -10.6977593
## home_improvement-debt_consolidation -19.8508778 -23.379170 -16.3225861
## major_purchase-debt_consolidation -21.6676094 -25.811059 -17.5241595
## small_business-debt_consolidation -14.0550680 -17.607866 -10.5022704
## home_improvement-educational -4.5265398 -10.043712 0.9906327
## major_purchase-educational -6.3432714 -12.272734 -0.4138089
## small_business-educational 1.2692700 -4.263606 6.8021463
## major_purchase-home_improvement -1.8167317 -6.935534 3.3020708
## small_business-home_improvement 5.7958098 1.142173 10.4494463
## small_business-major_purchase 7.6125415 2.476817 12.7482660
## p adj
## credit_card-all_other 0.0000000
## debt_consolidation-all_other 0.0000000
## educational-all_other 0.9977274
## home_improvement-all_other 0.0002872
## major_purchase-all_other 0.0000125
## small_business-all_other 0.9999573
## debt_consolidation-credit_card 0.9973079
## educational-credit_card 0.0000000
## home_improvement-credit_card 0.0000000
## major_purchase-credit_card 0.0000000
## small_business-credit_card 0.0000000
## educational-debt_consolidation 0.0000000
## home_improvement-debt_consolidation 0.0000000
## major_purchase-debt_consolidation 0.0000000
## small_business-debt_consolidation 0.0000000
## home_improvement-educational 0.1903721
## major_purchase-educational 0.0269245
## small_business-educational 0.9938750
## major_purchase-home_improvement 0.9430705
## small_business-home_improvement 0.0045156
## small_business-major_purchase 0.0002518
aov6mts=aov(inq.last.6mths ~ purpose, data = loans)
aov6mtssummary=summary(aov6mts)
aov6mtssummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 298 49.68 10.32 1.99e-11 ***
## Residuals 9571 46065 4.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov6mtsturkey=TukeyHSD(aov6mts)
aov6mtsturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = inq.last.6mths ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other -0.259023456 -0.485124080 -0.03292283
## debt_consolidation-all_other -0.185042944 -0.353960999 -0.01612489
## educational-all_other 0.207723759 -0.166418248 0.58186577
## home_improvement-all_other 0.294672824 0.003987328 0.58535832
## major_purchase-all_other -0.083574585 -0.420819301 0.25367013
## small_business-all_other 0.287260489 -0.005268232 0.57978921
## debt_consolidation-credit_card 0.073980512 -0.135168064 0.28312909
## educational-credit_card 0.466747215 0.072802982 0.86069145
## home_improvement-credit_card 0.553696280 0.237930742 0.86946182
## major_purchase-credit_card 0.175448872 -0.183638606 0.53453635
## small_business-credit_card 0.546283946 0.228820765 0.86374713
## educational-debt_consolidation 0.392766703 0.028618552 0.75691485
## home_improvement-debt_consolidation 0.479715768 0.202011443 0.75742009
## major_purchase-debt_consolidation 0.101468359 -0.224653755 0.42759047
## small_business-debt_consolidation 0.472303433 0.192670303 0.75193656
## home_improvement-educational 0.086949065 -0.347295824 0.52119396
## major_purchase-educational -0.291298343 -0.757993710 0.17539702
## small_business-educational 0.079536730 -0.355944176 0.51501764
## major_purchase-home_improvement -0.378247409 -0.781137446 0.02464263
## small_business-home_improvement -0.007412335 -0.373690148 0.35886548
## small_business-major_purchase 0.370835074 -0.033386867 0.77505701
## p adj
## credit_card-all_other 0.0129519
## debt_consolidation-all_other 0.0211591
## educational-all_other 0.6580146
## home_improvement-all_other 0.0444600
## major_purchase-all_other 0.9907182
## small_business-all_other 0.0581489
## debt_consolidation-credit_card 0.9439626
## educational-credit_card 0.0086671
## home_improvement-credit_card 0.0000049
## major_purchase-credit_card 0.7795481
## small_business-credit_card 0.0000082
## educational-debt_consolidation 0.0248091
## home_improvement-debt_consolidation 0.0000074
## major_purchase-debt_consolidation 0.9699015
## small_business-debt_consolidation 0.0000133
## home_improvement-educational 0.9971007
## major_purchase-educational 0.5203523
## small_business-educational 0.9982673
## major_purchase-home_improvement 0.0822557
## small_business-home_improvement 1.0000000
## small_business-major_purchase 0.0969396
aov2yrs=aov(delinq.2yrs ~ purpose, data = loans)
aov2yrssummary=summary(aov2yrs)
aov2yrssummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 1.4 0.2261 0.758 0.603
## Residuals 9571 2855.9 0.2984
aov2yrsturkey=TukeyHSD(aov2yrs)
aov2yrsturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = delinq.2yrs ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other -0.028404112 -0.08470194 0.02789372
## debt_consolidation-all_other -0.016496189 -0.05855587 0.02556349
## educational-all_other -0.022316777 -0.11547610 0.07084255
## home_improvement-all_other -0.043026219 -0.11540533 0.02935289
## major_purchase-all_other -0.005838136 -0.08981024 0.07813397
## small_business-all_other -0.024662327 -0.09750039 0.04817574
## debt_consolidation-credit_card 0.011907923 -0.04016894 0.06398478
## educational-credit_card 0.006087334 -0.09200264 0.10417731
## home_improvement-credit_card -0.014622108 -0.09324601 0.06400180
## major_purchase-credit_card 0.022565975 -0.06684486 0.11197681
## small_business-credit_card 0.003741785 -0.07530482 0.08278839
## educational-debt_consolidation -0.005820589 -0.09649150 0.08485032
## home_improvement-debt_consolidation -0.026530031 -0.09567690 0.04261684
## major_purchase-debt_consolidation 0.010658052 -0.07054458 0.09186069
## small_business-debt_consolidation -0.008166138 -0.07779327 0.06146099
## home_improvement-educational -0.020709442 -0.12883406 0.08741518
## major_purchase-educational 0.016478641 -0.09972597 0.13268325
## small_business-educational -0.002345549 -0.11077793 0.10608683
## major_purchase-home_improvement 0.037188083 -0.06312935 0.13750551
## small_business-home_improvement 0.018363893 -0.07283729 0.10956508
## small_business-major_purchase -0.018824190 -0.11947326 0.08182488
## p adj
## credit_card-all_other 0.7522736
## debt_consolidation-all_other 0.9101492
## educational-all_other 0.9922602
## home_improvement-all_other 0.5800892
## major_purchase-all_other 0.9999938
## small_business-all_other 0.9544771
## debt_consolidation-credit_card 0.9939822
## educational-credit_card 0.9999969
## home_improvement-credit_card 0.9980816
## major_purchase-credit_card 0.9897696
## small_business-credit_card 0.9999994
## educational-debt_consolidation 0.9999962
## home_improvement-debt_consolidation 0.9185513
## major_purchase-debt_consolidation 0.9997392
## small_business-debt_consolidation 0.9998646
## home_improvement-educational 0.9977371
## major_purchase-educational 0.9995917
## small_business-educational 1.0000000
## major_purchase-home_improvement 0.9303271
## small_business-home_improvement 0.9970089
## small_business-major_purchase 0.9980198
aovpubrec=aov(pub.rec ~ purpose, data = loans)
aovpubrecsummary=summary(aovpubrec)
aovpubrecsummary
## Df Sum Sq Mean Sq F value Pr(>F)
## purpose 6 1.1 0.18353 2.674 0.0136 *
## Residuals 9571 656.9 0.06864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aovpubrecturkey=TukeyHSD(aovpubrec)
aovpubrecturkey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = pub.rec ~ purpose, data = loans)
##
## $purpose
## diff lwr upr
## credit_card-all_other 2.405972e-02 -0.002941177 0.05106061
## debt_consolidation-all_other 2.245991e-02 0.002287750 0.04263207
## educational-all_other -4.316270e-03 -0.048996238 0.04036370
## home_improvement-all_other 1.872461e-02 -0.015989000 0.05343821
## major_purchase-all_other 6.871860e-06 -0.040266829 0.04028057
## small_business-all_other 8.494763e-03 -0.026438962 0.04342849
## debt_consolidation-credit_card -1.599805e-03 -0.026576289 0.02337668
## educational-credit_card -2.837599e-02 -0.075420734 0.01866876
## home_improvement-credit_card -5.335110e-03 -0.043043772 0.03237355
## major_purchase-credit_card -2.405285e-02 -0.066935005 0.01882931
## small_business-credit_card -1.556495e-02 -0.053476348 0.02234644
## educational-debt_consolidation -2.677618e-02 -0.070262686 0.01671032
## home_improvement-debt_consolidation -3.735306e-03 -0.036898704 0.02942809
## major_purchase-debt_consolidation -2.245304e-02 -0.061398482 0.01649240
## small_business-debt_consolidation -1.396515e-02 -0.047358886 0.01942859
## home_improvement-educational 2.304088e-02 -0.028816567 0.07489832
## major_purchase-educational 4.323141e-03 -0.051409532 0.06005581
## small_business-educational 1.281103e-02 -0.039194016 0.06481608
## major_purchase-home_improvement -1.871774e-02 -0.066830788 0.02939532
## small_business-home_improvement -1.022984e-02 -0.053970672 0.03351098
## small_business-major_purchase 8.487891e-03 -0.039784217 0.05676000
## p adj
## credit_card-all_other 0.1177474
## debt_consolidation-all_other 0.0178038
## educational-all_other 0.9999567
## home_improvement-all_other 0.6884216
## major_purchase-all_other 1.0000000
## small_business-all_other 0.9916123
## debt_consolidation-credit_card 0.9999962
## educational-credit_card 0.5625641
## home_improvement-credit_card 0.9995971
## major_purchase-credit_card 0.6468635
## small_business-credit_card 0.8902906
## educational-debt_consolidation 0.5372859
## home_improvement-debt_consolidation 0.9998932
## major_purchase-debt_consolidation 0.6159793
## small_business-debt_consolidation 0.8813089
## home_improvement-educational 0.8474378
## major_purchase-educational 0.9999882
## small_business-educational 0.9910093
## major_purchase-home_improvement 0.9133346
## small_business-home_improvement 0.9932000
## small_business-major_purchase 0.9986018
The Chi-square test of Independence determines whether there is an association between two categorical variables i.e. whether the variables are independent or related.
chi-square test for purpose vs credit policy
test = chisq.test(table(loans$purpose,loans$credit.policy))
test
##
## Pearson's Chi-squared test
##
## data: table(loans$purpose, loans$credit.policy)
## X-squared = 21.958, df = 6, p-value = 0.001232
test$observed
##
## FALSE TRUE
## all_other 496 1835
## credit_card 242 1020
## debt_consolidation 734 3223
## educational 89 254
## home_improvement 117 512
## major_purchase 66 371
## small_business 124 495
test$expected
##
## FALSE TRUE
## all_other 454.61558 1876.3844
## credit_card 246.12821 1015.8718
## debt_consolidation 771.73481 3185.2652
## educational 66.89539 276.1046
## home_improvement 122.67404 506.3260
## major_purchase 85.22823 351.7718
## small_business 120.72374 498.2763
test$residuals
##
## FALSE TRUE
## all_other 1.9409518 -0.9553797
## credit_card -0.2631365 0.1295217
## debt_consolidation -1.3583388 0.6686046
## educational 2.7026193 -1.3302894
## home_improvement -0.5122906 0.2521608
## major_purchase -2.0828002 1.0252006
## small_business 0.2981822 -0.1467719
corrplot(test$residuals, is.cor = FALSE)
Since the p-value between “purpose” and “credit policy” is less than our
chosen significance level of (α = 0.05), we can reject the null
hypothesis. We can conclude that there is enough evidence to suggest an
association between “purpose” and “credit policy”.
chi-square test for purpose vs not fully paid
test = chisq.test(table(loans$purpose,loans$not.fully.paid))
test
##
## Pearson's Chi-squared test
##
## data: table(loans$purpose, loans$not.fully.paid)
## X-squared = 96.985, df = 6, p-value < 2.2e-16
test$observed
##
## FALSE TRUE
## all_other 1944 387
## credit_card 1116 146
## debt_consolidation 3354 603
## educational 274 69
## home_improvement 522 107
## major_purchase 388 49
## small_business 447 172
test$expected
##
## FALSE TRUE
## all_other 1957.9134 373.08655
## credit_card 1060.0115 201.98852
## debt_consolidation 3323.6652 633.33483
## educational 288.1014 54.89862
## home_improvement 528.3259 100.67415
## major_purchase 367.0563 69.94373
## small_business 519.9264 99.07361
test$residuals
##
## FALSE TRUE
## all_other -0.3144402 0.7203274
## credit_card 1.7196643 -3.9394502
## debt_consolidation 0.5261783 -1.2053825
## educational -0.8307855 1.9031843
## home_improvement -0.2752124 0.6304635
## major_purchase 1.0931697 -2.5042608
## small_business -3.1982603 7.3266552
corrplot(test$residuals, is.cor = FALSE)
Since the p-value between “purpose” and “not fully paid” is less than
our chosen significance level of (α = 0.05), we can reject the null
hypothesis. We can conclude that there is enough evidence to suggest an
association between “purpose” and “not fully paid”.
chi-square test for credit policy vs not fully paid
test = chisq.test(table(loans$credit.policy,loans$not.fully.paid))
test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(loans$credit.policy, loans$not.fully.paid)
## X-squared = 238.38, df = 1, p-value < 2.2e-16
test$observed
##
## FALSE TRUE
## FALSE 1349 519
## TRUE 6696 1014
test$expected
##
## FALSE TRUE
## FALSE 1569.019 298.9814
## TRUE 6475.981 1234.0186
test$residuals
##
## FALSE TRUE
## FALSE -5.554504 12.724399
## TRUE 2.734051 -6.263232
corrplot(test$residuals, is.cor = FALSE)
Since the p-value between “credit policy” and “not fully paid” is less
than our chosen significance level of (α = 0.05), we can reject the null
hypothesis. We can conclude that there is enough evidence to suggest an
association between “credit policy” and “not fully paid.
We want to create a Q-Q plot for each numeric variable so we can perform a normality test for each.Please, note that the easiest way to interpret the findings is how closely the data resembles the black reference line representing the normal distribution for the variable.
# By gathering the variables we want to see into a long format with the gather() function, we can then create a Q-Q plot
# for each variable using the facet_wrap() function in ggplot2.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(sample = value)) +
geom_qq(color = "steelblue") +
geom_qq_line() +
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Q-Q Plots of Numeric Variables", x = "Theoretical", y = "Sample") +
theme_minimal()
Risks to our analysis and opportunities for future analyses:
Private individuals historically made up the bulk of lenders in P2P markets. However, high interest rates and the prospects of risky borrowers undermined P2P lending as a legitimate financial industry. Combined with the urge for more growth by intermediaries like LendingClub, these concerns began to prompt higher lending standards and discussions about more regulation.
By 2017, shortly after the peak of the P2P industry, larger institutions and banks began to take over private individuals as the primary sources of lending in P2P markets. We suspect/assume this shift in P2P lenders altered the makeup of who receives what, thereby rendering recent research on P2P loans as an investment opportunity less reliable as a guide for today’s prospective individual investors.
While the annual income data was given to us as a log, the revolving
balance was given to use unmodified. We discovered that taking the log
of of the revol.bal variable gives a better result that
looks more normal, but encountered an issue. There are some loans with
revol.bal value of 0, and when you take the log of that you
get -Inf. We will need to decide how to handle this in the future. For
now, we want to demonstrate the results of taking the log of
revol.baland how it increases the readability of the data
and the variable resebles a normal distribution when is always a good
characteristics for further modelling.
loans$revol.bal=log(loans$revol.bal)
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "steelblue", color = "black") +
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Count") +
theme_minimal()
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_boxplot(fill = "steelblue", color = "black",
outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Boxplots of Numeric Variables", x = "Value") +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each credit policy value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, credit.policy) %>%
gather(variable, value, -credit.policy) %>%
ggplot(aes(x = value, y = as.logical(credit.policy), fill = as.logical(credit.policy))) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `credit.policy` Values",
x = "Value", y = "Count", fill = "Credit Policy") +
theme_minimal()
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each not fully paid value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, not.fully.paid) %>%
gather(variable, value, -not.fully.paid) %>%
ggplot(aes(x = value, y = as.logical(not.fully.paid), fill = as.logical(not.fully.paid))) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) + # Translucent and larger outliers to help with overplotting
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `not.fully.paid` Values",
x = "Value", y = "Count", fill = "Not Fully Paid") +
theme_minimal()
# By gathering the variables we want to see into a long format with the gather() function, we can then create a boxplot
# for each variable using the facet_wrap() function in ggplot2. We can see this for each purpose value by excluding
# it in the gather() function.
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec, purpose) %>%
gather(variable, value, -purpose) %>%
ggplot(aes(x = value, y = purpose, fill = purpose)) +
geom_boxplot(outlier.size = 2, outlier.alpha = 0.2) +
guides(fill = guide_legend(reverse = TRUE)) + # So the legend order matches the order in the graphs
facet_wrap(~ variable, scales = "free_x") + # Free x scale so the graphs are readable
labs(title = "Boxplots of Numeric Variables", subtitle = "Comparing `purpose` Values",
x = "Value", y = "Count", fill = "Purpose") +
theme_minimal()
loans %>%
select(int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, revol.util,
inq.last.6mths, delinq.2yrs, pub.rec) %>%
gather(variable, value) %>%
ggplot(aes(sample = value)) +
geom_qq(color = "steelblue") +
geom_qq_line() +
facet_wrap(~ variable, scales = "free") + # Free scales so the graphs are readable
labs(title = "Q-Q Plots of Numeric Variables", x = "Theoretical", y = "Sample") +
theme_minimal()
We performed z-interval tests for each variable, but decided that t-interval was more appropriate. We are keeping the code here so that we don’t lose the work.Please note that in this case z test is not appropriate because we do not have an idea about the standard deviation of the population. However, in order to better understand the working algorithm of the statistical tests especially the z-score or the z-statistics and how it compares to the t-statistics.
loans$revol.bal=exp(loans$revol.bal)
# This code will perform the z-interval tests we want, but we will show the results in a nicer looking table format
# For the purpose of these z-interval tests we are assuming that the data is normal and therefore has a standard deviation of 2.31
loadPkg("BSDA")
ztest95rate = z.test(x=loans$int.rate, sigma.x = sd(loans$int.rate)) # default conf.level = 0.95
ztest99rate = z.test(x=loans$int.rate, sigma.x = 2.31, conf.level=0.99 )
ztest50rate = z.test(x=loans$int.rate, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95rate, ztest99rate, ztest50rate), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternat…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.123 447. 0 0.122 0.123 One-sample z-Test two.sided
## 2 0.123 5.20 0.000000204 0.0618 0.183 One-sample z-Test two.sided
## 3 0.123 5.20 0.000000204 0.107 0.139 One-sample z-Test two.sided
## # … with abbreviated variable name ¹alternative
png("z1.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95installment = z.test(x=loans$installment, sigma.x = 2.31) # default conf.level = 0.95
ztest99installment = z.test(x=loans$installment, sigma.x = 2.31, conf.level=0.99 )
ztest50installment = z.test(x=loans$installment, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95installment,ztest99installment,ztest50installment), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 319. 13519. 0 319. 319. One-sample z-Test two.sided
## 2 319. 13519. 0 319. 319. One-sample z-Test two.sided
## 3 319. 13519. 0 319. 319. One-sample z-Test two.sided
png("z2.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95annual = z.test(x=loans$log.annual.inc, sigma.x = 2.31) # default conf.level = 0.95
ztest99annual = z.test(x=loans$log.annual.inc, sigma.x = 2.31, conf.level=0.99 )
ztest50annual = z.test(x=loans$log.annual.inc, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95annual,ztest99annual,ztest50annual), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 10.9 463. 0 10.9 11.0 One-sample z-Test two.sided
## 2 10.9 463. 0 10.9 11.0 One-sample z-Test two.sided
## 3 10.9 463. 0 10.9 10.9 One-sample z-Test two.sided
png("z3.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95fico = z.test(x=loans$fico, sigma.x = 2.31) # default conf.level = 0.95
ztest99fico = z.test(x=loans$fico, sigma.x = 2.31, conf.level=0.99 )
ztest50fico = z.test(x=loans$fico, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95fico,ztest99fico,ztest50fico), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 711. 30116. 0 711. 711. One-sample z-Test two.sided
## 2 711. 30116. 0 711. 711. One-sample z-Test two.sided
## 3 711. 30116. 0 711. 711. One-sample z-Test two.sided
png("z4.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95dti = z.test(x=loans$dti, sigma.x = 2.31) # default conf.level = 0.95
ztest99dti = z.test(x=loans$dti, sigma.x = 2.31, conf.level=0.99 )
ztest50dti = z.test(x=loans$dti, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95dti,ztest99dti,ztest50dti), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 12.6 534. 0 12.6 12.7 One-sample z-Test two.sided
## 2 12.6 534. 0 12.5 12.7 One-sample z-Test two.sided
## 3 12.6 534. 0 12.6 12.6 One-sample z-Test two.sided
png("z5.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95days.with.cr.line = z.test(x=loans$days.with.cr.line, sigma.x = 2.31) # default conf.level = 0.95
ztest99days.with.cr.line = z.test(x=loans$days.with.cr.line, sigma.x = 2.31, conf.level=0.99 )
ztest50days.with.cr.line = z.test(x=loans$days.with.cr.line, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95days.with.cr.line,ztest99days.with.cr.line,ztest50days.with.cr.line), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 4561. 193225. 0 4561. 4561. One-sample z-Test two.sided
## 2 4561. 193225. 0 4561. 4561. One-sample z-Test two.sided
## 3 4561. 193225. 0 4561. 4561. One-sample z-Test two.sided
png("z6.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95revol.bal = z.test(x=loans$revol.bal, sigma.x = 2.31) # default conf.level = 0.95
ztest99revol.bal = z.test(x=loans$revol.bal, sigma.x = 2.31, conf.level=0.99 )
ztest50revol.bal = z.test(x=loans$revol.bal, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95revol.bal,ztest99revol.bal,ztest50revol.bal), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 16914. 716590. 0 16914. 16914. One-sample z-Test two.sided
## 2 16914. 716590. 0 16914. 16914. One-sample z-Test two.sided
## 3 16914. 716590. 0 16914. 16914. One-sample z-Test two.sided
png("z7.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95revol.util = z.test(x=loans$revol.util, sigma.x = 2.31) # default conf.level = 0.95
ztest99revol.util = z.test(x=loans$revol.util, sigma.x = 2.31, conf.level=0.99 )
ztest50revol.util = z.test(x=loans$revol.util, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95revol.util,ztest99revol.util,ztest50revol.util), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 46.8 1983. 0 46.8 46.8 One-sample z-Test two.sided
## 2 46.8 1983. 0 46.7 46.9 One-sample z-Test two.sided
## 3 46.8 1983. 0 46.8 46.8 One-sample z-Test two.sided
png("z8.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95inq.last.6mths = z.test(x=loans$inq.last.6mths, sigma.x = 2.31) # default conf.level = 0.95
ztest99inq.last.6mths = z.test(x=loans$inq.last.6mths, sigma.x = 2.31, conf.level=0.99 )
ztest50inq.last.6mths = z.test(x=loans$inq.last.6mths, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95inq.last.6mths,ztest99inq.last.6mths,ztest50inq.last.6mths), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1.58 66.8 0 1.53 1.62 One-sample z-Test two.sided
## 2 1.58 66.8 0 1.52 1.64 One-sample z-Test two.sided
## 3 1.58 66.8 0 1.56 1.59 One-sample z-Test two.sided
png("z9.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95delinq.2yrs = z.test(x=loans$delinq.2yrs, sigma.x = 2.31)# default conf.level = 0.95
ztest99delinq.2yrs = z.test(x=loans$delinq.2yrs, sigma.x = 2.31, conf.level=0.99 )
ztest50delinq.2yrs = z.test(x=loans$delinq.2yrs, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95delinq.2yrs,ztest99delinq.2yrs,ztest50delinq.2yrs), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.164 6.94 4.04e-12 0.117 0.210 One-sample z-Test two.sided
## 2 0.164 6.94 4.04e-12 0.103 0.225 One-sample z-Test two.sided
## 3 0.164 6.94 4.04e-12 0.148 0.180 One-sample z-Test two.sided
png("z10.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
ztest95pub.rec = z.test(x=loans$pub.rec, sigma.x = 2.31) # default conf.level = 0.95
ztest99pub.rec = z.test(x=loans$pub.rec, sigma.x = 2.31, conf.level=0.99 )
ztest50pub.rec = z.test(x=loans$pub.rec, sigma.x = 2.31, conf.level=0.50 )
tab <- map_df(list(ztest95pub.rec,ztest99pub.rec,ztest50pub.rec), tidy)
tab
## # A tibble: 3 × 7
## estimate statistic p.value conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.0621 2.63 0.00849 0.0159 0.108 One-sample z-Test two.sided
## 2 0.0621 2.63 0.00849 0.00132 0.123 One-sample z-Test two.sided
## 3 0.0621 2.63 0.00849 0.0462 0.0780 One-sample z-Test two.sided
png("z11.png", height=100, width=700)
p<-tableGrob(tab)
grid.arrange(p)
dev.off()
## quartz_off_screen
## 2
Back to original↩︎
Peng, Roger D., and Elizabeth Matsui. “Chapter 4.” The Art of Data Science: A Guide for Anyone Who Works with Data, Skybrude Consulting LLC, Victoria, 2016, pp. 33–33.↩︎
Our code requires the tidyverse and ezids libraries.↩︎
https://towardsdatascience.com/end-to-end-case-study-classification-lending-club-data-489f8a1b100a↩︎
https://towardsdatascience.com/turning-lending-clubs-worst-loans-into-investment-gold-475ec97f58ee↩︎
https://www.debt.org/credit/loans/personal/lending-club-review/↩︎
https://nycdatascience.com/blog/student-works/lendingclub-profitability-predictions/↩︎
https://nycdatascience.com/blog/student-works/project-1-analysis-of-lending-clubs-data/↩︎
https://michel-kana.medium.com/lendingclub-bias-in-data-machine-learning-and-investment-strategy-3a3bd1c65f0↩︎